Introduction

This document will examine data related to Covid 19 confirmed cases and deaths in the U.S. Data provided by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). More information regarding this data, its sources and caveats can be found here.

Exploratory Analysis of COVID-19 Data

The COVID-19 pandemic has generated an unprecedented amount of data. This notebook performs an exploratory data analysis on the time series data compiled by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University.

The primary goal of this analysis is to process, explore, and visualize the provided datasets for confirmed cases and covid related deaths, both in the US and globally. By doing so, we aim to uncover trends, compare the impact across different regions, and gain a clearer understanding of the pandemic’s progression over time. This exploration will serve as a foundation for asking more specific questions and potentially building predictive models.


Data Collection

The data in this repository is stored in separate csv files between confirmed cases and deaths per county.

Additionally, the data is in a wide format with individual dates as columns. In this initial data load we will also pivot both tables into a “tall” format, coerce date columns into an appropriate date datatype, and fix a small issue where the Kansas City FIPS code is missing.

————————–BEFORE PIVOT———————

        Number of rows in the Confirmed Table: 3342 

        Number of columns in the Confirmed Table: 1154 

        Number of rows in the Deaths Table: 3342 

        Number of columns in the Deaths Table: 1155 

————————–AFTER PIVOT———————

        Number of rows in the Confirmed Table: 3819906 

        Number of columns in the Confirmed Table: 5 

        Number of rows in the Deaths Table: 3819906 

        Number of columns in the Deaths Table: 6 

Notice: There is one additional Column in the deaths dataset. This reperesents the population of each county and this will be used in a later join.


Confirmed Cases by County

Sample of data after pivot.

knitr::kable(tail(confirmed_us), caption = "Confirmed Covid Cases by county")
Confirmed Covid Cases by county
Province_State Country_Region Combined_Key date cases
Wyoming US Weston, Wyoming, US 3/4/23 1905
Wyoming US Weston, Wyoming, US 3/5/23 1905
Wyoming US Weston, Wyoming, US 3/6/23 1905
Wyoming US Weston, Wyoming, US 3/7/23 1905
Wyoming US Weston, Wyoming, US 3/8/23 1905
Wyoming US Weston, Wyoming, US 3/9/23 1905

Deaths by County Table

Sample of data after pivot.

knitr::kable(tail(deaths_us), caption = "Deaths from Covid by county")
Deaths from Covid by county
Province_State Country_Region Combined_Key Population date deaths
Wyoming US Weston, Wyoming, US 6927 3/4/23 23
Wyoming US Weston, Wyoming, US 6927 3/5/23 23
Wyoming US Weston, Wyoming, US 6927 3/6/23 23
Wyoming US Weston, Wyoming, US 6927 3/7/23 23
Wyoming US Weston, Wyoming, US 6927 3/8/23 23
Wyoming US Weston, Wyoming, US 6927 3/9/23 23

Population Density Table

In order to use this data to discover statistics regarding how covid effected areas with greater or lesser population density, first we must load a dataset with data on the square land mass per county.

This we can get from the US Census website as a .txt file.

Issue: New Connecticut County Codes for 2024

https://www.bls.gov/cew/classifications/areas/new-2024-connecticut-counties.htm

In an attempt to get land mass for Connecticut counties, I found that the county codes were updated for 2024. Given the data is from 2020-2023, It is easier to import the 2021 Gazetteer file and then update the county codes for Connecticut.

Build a County level population and land area dataset

This will be joined to the covid dataset to calculate case density and death density.

population_df <- deaths_us_raw %>%
    select(FIPS, Combined_Key, Population) %>%
    distinct()

population_df <- population_df %>%
    left_join(df_gaz_counties %>% select(GEOID, ALAND_SQMI),
            by = c("FIPS" = "GEOID")) %>%
    mutate(density = Population / ALAND_SQMI)

knitr::kable(head(population_df), caption = "Population and Land Area per county table")
Population and Land Area per county table
FIPS Combined_Key Population ALAND_SQMI density
1001 Autauga, Alabama, US 55869 594.456 93.98341
1003 Baldwin, Alabama, US 223234 1589.836 140.41323
1005 Barbour, Alabama, US 24686 885.008 27.89353
1007 Bibb, Alabama, US 22394 622.470 35.97603
1009 Blount, Alabama, US 57826 644.891 89.66787
1011 Bullock, Alabama, US 10101 622.815 16.21830

Joining the 3 tables

In this final step in the data preperation, we will join the confirmed_us, deaths_us and population_df tables together.

Next we need to remove several records from the data that won’t be useful to us. These are areas that do not fit into a US county designation. In these cases the county name begins with Out of <state> or Unassigned. A quick check shows very little population in these areas and it is safe to simply remove them from this analysis.

Finally, we can add to manipulated columns at this point to help us in the next sections. We will create a feature that calcutates the “growth” or “new” case/deaths per day by simply subtacting the difference from the previous day.

# Convert date to proper Date format in both datasets before joining
confirmed_us <- confirmed_us %>%
    mutate(date = mdy(date, quiet = FALSE))

deaths_us <- deaths_us %>%
    mutate(date = mdy(date, quiet = FALSE))

covid_df <- confirmed_us %>%
    full_join((deaths_us))
## Joining with `by = join_by(Province_State, Country_Region, Combined_Key, date)`
covid_df <- covid_df %>%
    left_join((population_df))
## Joining with `by = join_by(Combined_Key, Population)`
covid_df <- covid_df %>%
    filter(!str_starts(Combined_Key, "Out of|Unassigned"))

# Calculate a "new" value for daily values
covid_df <- covid_df %>%
    group_by(Combined_Key) %>%
    arrange(date) %>%
    mutate(
        new_cases = cases - lag(cases, default = 0),
        new_deaths = deaths - lag(deaths, default = 0)
    ) %>%
    ungroup()

Final Covid Data

The final prepared data.

————————Summary—————–

        Number of rows in the Covid Table: 3701034 

        Number of columns in the Covid Table: 12 
Table continues below
Province_State Country_Region Combined_Key date
Length:3701034 Length:3701034 Length:3701034 Min. :2020-01-22
Class :character Class :character Class :character 1st Qu.:2020-11-02
Mode :character Mode :character Mode :character Median :2021-08-15
NA NA NA Mean :2021-08-15
NA NA NA 3rd Qu.:2022-05-28
NA NA NA Max. :2023-03-09
NA NA NA NA
Table continues below
cases Population deaths FIPS
Min. : 0 Min. : 0 Min. : 0.0 Min. : 60
1st Qu.: 409 1st Qu.: 11004 1st Qu.: 5.0 1st Qu.:19025
Median : 2409 Median : 26113 Median : 40.0 Median :30019
Mean : 14446 Mean : 102803 Mean : 189.2 Mean :31375
3rd Qu.: 8438 3rd Qu.: 67215 3rd Qu.: 126.0 3rd Qu.:46105
Max. :3710586 Max. :10039107 Max. :35545.0 Max. :99999
NA NA NA NA’s :10287
Covid DF prepared
ALAND_SQMI density new_cases new_deaths
Min. : 2.05 Min. : 0.04 Min. :-27000.00 Min. :-591.0000
1st Qu.: 419.04 1st Qu.: 17.20 1st Qu.: 0.00 1st Qu.: 0.0000
Median : 604.56 Median : 46.21 Median : 1.00 Median : 0.0000
Mean : 1097.79 Mean : 296.55 Mean : 27.86 Mean : 0.2989
3rd Qu.: 914.96 3rd Qu.: 135.75 3rd Qu.: 10.00 3rd Qu.: 0.0000
Max. :145575.56 Max. :71882.16 Max. :110441.00 Max. :2806.0000
NA’s :18288 NA’s :18288 NA NA
Province_State Country_Region Combined_Key date cases Population deaths FIPS ALAND_SQMI density new_cases new_deaths
Illinois US McLean, Illinois, US 2021-05-12 18114 171517 178 17113 1183.248 144.954397 28 0
Illinois US Bureau, Illinois, US 2020-11-21 1903 32628 33 17011 869.071 37.543538 41 1
North Dakota US Towner, North Dakota, US 2020-07-21 4 2189 0 38095 1025.113 2.135374 0 0
Michigan US Cheboygan, Michigan, US 2020-02-27 0 25276 0 26031 715.279 35.337260 0 0
Kansas US Greenwood, Kansas, US 2021-12-31 1195 5982 19 20073 1143.304 5.232204 0 0

Introductory Exploration of COVID-19 Data

In the following sections we will explore some of the key trends and patterns in the COVID-19 data. This will include visualizations of case and death counts over time and comparisons between regions.

US Total View

First we will look at an aggregated view of the United States and view the relationship between Covid Cases and Covid Deaths. To do this we will create a new table grouping all counties by date to get a national view.

Additionally, We will convert this US aggregated data into weekly granularity for better visualization and to compare with areas that only reported weekly.

# Group all counties to get a national view by day
us_covid_df <- covid_df %>%
    group_by(Country_Region, date) %>%
    arrange(date) %>%
    summarize(population = sum(Population),
        cases = sum(cases),
        deaths = sum(deaths),
        new_cases = sum(new_cases),
        new_deaths = sum(new_deaths),
        .groups = 'drop') %>%
    mutate(deaths_per_mill = (deaths * 1000000 / population)) %>%
    select(Country_Region, date, cases, deaths, deaths_per_mill, population, new_cases, new_deaths) %>%
    ungroup()

# Convert US data to weekly aggregation for better visualization and to compare with areas that only reported weekly.
us_covid_weekly <- us_covid_df %>%
    mutate(week = floor_date(date, "week")) %>%
    group_by(Country_Region, week) %>%
    summarize(
        population = first(population),
        cases = last(cases),  # Take last value of week for cumulative
        deaths = last(deaths),
        new_cases = sum(new_cases),  # Sum daily new cases for the week
        new_deaths = sum(new_deaths),
        .groups = 'drop'
    ) %>%
    mutate(deaths_per_mill = (deaths * 1000000 / population))

US Visualiztions

# Time series plot of new cases and new deaths for the US
us_timeseries <- plot_ly(us_covid_weekly, x = ~week) %>%
    add_lines(y = ~new_cases, name = "New Cases", 
              line = list(color = "blue")) %>%
    add_lines(y = ~new_deaths, name = "New Deaths", 
              line = list(color = "red"), 
              yaxis = "y2") %>%
    layout(
        title = "US COVID-19: New Cases and Deaths Over Time",
        xaxis = list(title = "Date"),
        yaxis = list(
            title = "New Cases",
            side = "left",
            titlefont = list(color = "blue"),
            tickfont = list(color = "blue")
        ),
        yaxis2 = list(
            title = "New Deaths",
            side = "right",
            overlaying = "y",
            titlefont = list(color = "red"),
            tickfont = list(color = "red")
        ),
        hovermode = "x unified"
    )

us_timeseries
Reactions

The intial view shows the relationship between covid cases and covid deaths. While there is an unfortunate spike in cases and deaths in January of 2022, we can also see that the ratio of cases to deaths has also decreased by this time. This could be a result of hospitals and communities being better equipped to care for covid patients or the effectiveness of the vacine reduce the mortality rate of the virus.

US Aggregated

Next we will look at the same visual in a cumulative form.

# Time series plot of cases and deaths for the US
us_timeseries_total <- plot_ly(us_covid_df, x = ~date) %>%
    add_lines(y = ~cases, name = "Cases", 
              line = list(color = "blue")) %>%
    add_lines(y = ~deaths, name = "Deaths", 
              line = list(color = "red"), 
              yaxis = "y2") %>%
    layout(
        title = "US COVID-19: Cases and Deaths Over Time",
        xaxis = list(title = "Date"),
        yaxis = list(
            title = "Cases",
            side = "left",
            titlefont = list(color = "blue"),
            tickfont = list(color = "blue")
        ),
        yaxis2 = list(
            title = "New Deaths",
            side = "right",
            overlaying = "y",
            titlefont = list(color = "red"),
            tickfont = list(color = "red")
        ),
        hovermode = "x unified"
    )

us_timeseries_total
Reactions

This view helps further us understand the relationship between cases and deaths. The gap between the too lines showing us a timeframe where the virus was more deadly.


State Vizualizations

# Group counties by state to get a state wide view by date
state_covid_df <- covid_df %>%
    group_by(Province_State, date) %>%
    arrange(date) %>%
    summarize(population = sum(Population),
        cases = sum(cases),
        deaths = sum(deaths),
        new_cases = sum(new_cases),
        new_deaths = sum(new_deaths),
        .groups = 'drop') %>%
    mutate(deaths_per_mill = (deaths * 1000000 / population)) %>%
    select(Province_State, date, cases, deaths, deaths_per_mill, population, new_cases, new_deaths) %>%
    ungroup()

# Convert state data to weekly aggregation
state_covid_weekly <- state_covid_df %>%
    mutate(week = floor_date(date, "week")) %>%
    group_by(Province_State, week) %>%
    summarize(
        population = first(population),
        cases = last(cases),
        deaths = last(deaths),
        new_cases = sum(new_cases),
        new_deaths = sum(new_deaths),
        .groups = 'drop'
    ) %>%
    mutate(deaths_per_mill = (deaths * 1000000 / population))
# Function to plot state COVID-19 time series
plot_state_covid <- function(state_name, data = state_covid_weekly) {
    # Filter data for the specified state
    state_data <- data %>% 
        filter(Province_State == state_name)
    
    # Check if state exists in data
    if(nrow(state_data) == 0) {
        stop(paste("State", state_name, "not found in data."))
    }
    
    # Create the plot
    state_plot <- plot_ly(state_data, x = ~week) %>%
        add_lines(y = ~new_cases, name = "New Cases", 
                  line = list(color = "blue")) %>%
        add_lines(y = ~new_deaths, name = "New Deaths", 
                  line = list(color = "red"), 
                  yaxis = "y2") %>%
        layout(
            title = paste(state_name, "COVID-19: New Cases and Deaths Over Time"),
            xaxis = list(title = "Date"),
            yaxis = list(
                title = "New Cases",
                side = "left",
                titlefont = list(color = "blue"),
                tickfont = list(color = "blue")
            ),
            yaxis2 = list(
                title = "New Deaths",
                side = "right",
                overlaying = "y",
                titlefont = list(color = "red"),
                tickfont = list(color = "red")
            ),
            hovermode = "x unified"
        )
    
    return(state_plot)
}

Next we will look at this same visualization of the relationship between cases and deaths, but at a state level. Clearly some states were exposed to the virus earlier than others and perhaps had a much different response to the epidemic.

We won’t examine every state here but get a sense of the variance between several states.

First we will examine the state of New York which saw the earliest spike in cases and deaths.

New York

# Plot NY
plot_state_covid("New York")
Reaction

New York saw the earliest exposure to the virus across the country. Given the density of the city and our lack of understand both in care and transmission, we can see an outsized impact on the number of deaths relative to the cases.

However, By the time the omicron variant surfaced, New York was better prepared for care and its citizens had a higher rate of vacinations relative to the country. Thus, while the covid cases saw another unfortunate spike, this time nearly 5 times higher than any previous spike, the death toll was almost equally 5 times lower than its peak ratio.

Next we will look at another state with fewer overall cases to examine the time and growth rates

Alabama

plot_state_covid('Alabama')
Reaction

Alabama in many shows a similar pattern of a high ratio of deaths upon the first wave of the virus and ultimately much lower fatalities during the high case spike in January of 2022. However, the first wave of infections did come nearly 6 months after New York. Unfortunately the state’s medical centers or perhaps its citizens were not able to prepare from NY’s learnings. Further, Alabama sees another spike in September of 2021 that is once again highly fatal.

Next lets examine a state with large population but less density

Illinois

plot_state_covid('Illinois')
Reaction

Illinois again has some similarities to New York and Alabama, but certainly suffered more later in the epidemic. Unlike Alabama, Illinois was apart of the first wave of the virus which was very deadly given the lack of understanding and preparation. Then again in January of 2021, Illinois suffered a spike in cases, yet unlike New York was not able to reduce the fatalities relative to the cases. Finally unlike Alabama and New York, the omicron variant in January of 2022 continued to be very fatal to Illinois residents.

Finally lets examine an outlier with large population yet low death counts

Florida

plot_state_covid('Florida')
Reaction

Unsure of the patterns here as Florida looks to be an outlier in how the virus effected residents. Through the first 20 months of the virus the fatality rates are very low compared to confirmed cases. However, the Omicron variant spike in January of 2022 did turn out to be very fatal.

Another intersting note on the data from Florida is that even though these charts were moved to a weekly granularity there are sawtooth data patterns in March of 2022. This likely indicates reporting shifts to some counties or hospitals only reporting every 2 weeks. Something to be aware of in the data.


Examine the impacts of Population Density on COVID-19 Cases, Deaths, and Growth Rates

In the next sections, we will analyze how population density influences the spread and severity of COVID-19. We will look at correlations between population density and case counts, death counts, and growth rates of infections across different counties and regions. This analysis will help us understand whether more densely populated areas experienced higher rates of infection and mortality during the pandemic.

Growth Rate

In order to examine we will look at a rolling 14-day average growth of cases and deaths per 100,000 people. This will give us a per capita view of how quickly the virus spread in different areas relative to their population size.

# use the zoo library with `rollmean` function
covid_df <- covid_df %>%
    group_by(Combined_Key) %>%
    arrange(date) %>%
    mutate(
        cases_14d_avg = rollmean(new_cases, k = 14, fill = NA, align = "right"),
        deaths_14d_avg = rollmean(new_deaths, k = 14, fill = NA, align = "right")
    ) %>%
    ungroup()
max_covid_df <- covid_df %>%
    group_by(Combined_Key) %>%
    summarize(
        Province_State = first(Province_State),
        Country_Region = first(Country_Region),
        Population = first(Population),
        FIPS = first(FIPS),
        ALAND_SQMI = first(ALAND_SQMI),
        density = first(density),
        max_date = max(date, na.rm = TRUE),
        total_cases = max(cases, na.rm = TRUE),
        total_deaths = max(deaths, na.rm = TRUE),
        peak_new_cases = max(new_cases, na.rm = TRUE),
        peak_new_deaths = max(new_deaths, na.rm = TRUE),
        peak_cases_14d_avg = max(cases_14d_avg, na.rm = TRUE),
        peak_deaths_14d_avg = max(deaths_14d_avg, na.rm = TRUE)
    ) %>%
    mutate(
        cases_per_capita = ifelse(Population == 0 | is.na(Population), 
                                     0, 
                                     total_cases*100000/Population),
        deaths_per_capita = ifelse(Population == 0 | is.na(Population), 
                                     0, 
                                     total_deaths*100000/Population),
        peak_avg_cases_growth_per_capita = ifelse(Population == 0 | is.na(Population), 
                                     0, 
                                     peak_cases_14d_avg*100000/Population),
        peak_avg_deaths_growth_per_capita = ifelse(Population == 0 | is.na(Population), 
                                     0, 
                                     peak_deaths_14d_avg*100000/Population),
        peak_avg_death_rate = ifelse(Population == 0 | is.na(Population), 
                                     0, 
                                     peak_deaths_14d_avg*100000/Population)) %>%
    ungroup()

Peak Data

pander(head(max_covid_df %>% select('Combined_Key', 'total_cases', 'total_deaths', 'peak_new_cases', 'peak_cases_14d_avg', 'peak_deaths_14d_avg', 'cases_per_capita' ,'deaths_per_capita','peak_avg_cases_growth_per_capita', 'peak_avg_deaths_growth_per_capita', 'peak_avg_death_rate')))
Table continues below
Combined_Key total_cases total_deaths peak_new_cases
Abbeville, South Carolina, US 7826 78 509
Acadia, Louisiana, US 18944 311 786
Accomack, Virginia, US 9119 119 365
Ada, Idaho, US 160373 1139 1451
Adair, Iowa, US 1805 52 48
Adair, Kentucky, US 7849 115 171
Table continues below
peak_cases_14d_avg peak_deaths_14d_avg cases_per_capita
83.57 0.4286 31908
181.1 2 30533
96.57 0.8571 28218
856.9 6.571 33301
9.929 0.5714 25238
42.57 1.143 40876
Table continues below
deaths_per_capita peak_avg_cases_growth_per_capita
318 340.7
501.2 292
368.2 298.8
236.5 177.9
727.1 138.8
598.9 221.7
peak_avg_deaths_growth_per_capita peak_avg_death_rate
1.747 1.747
3.223 3.223
2.652 2.652
1.365 1.365
7.99 7.99
5.952 5.952

Population Density and Death Rate

In this first visual we will examind the population density on a scatter plot.

In addition to the position on the visual indicating the ratio of population to land mass, we have also added additional information. The size of the marker indicates the Total Cases per Capita, and the color indicates the how fatal the cases were with the peak ratio of deaths during the viruses most aggressive period.

What I can see from this visual is that it is not necessarily high population density that is driving peaks in fatalities or even covid cases per capita.

Peak cases

This chart looks at Population Density on the Y axis and the peak new case growth per capita on the X axis.

Peak New Case Growth is not immediately intuitive, but what we are looking to answer is:

“Did the virus spread faster in counties with high population density?”

In the above visual we do not see a linear trend between the x and y variables. Meaning that Population density was not directly correlated with how fast the virus spread.

Additionally, what we might see in this data visualization is a trend that the deaths per 100k residents was higher in low density areas. I think a reasonable hypothesis to explore here would be access to care for the confirmed cases. Larger cities with higher density also have more hospitals and clinics closer to its residents

Deaths Per capita

Were deaths per capita greater in high density areas?

This visual also denies any obvious linear correlation to population density and the deaths per capita.

Death Growth Rate by Density

Is the peak growth rate of deaths (per capita) related to population density? In other words were areas with high population density overcome by a wave of the virus that they wer unable to handle? In this case we are looking at a death rate of total cases to deaths.

This view does present a relationship between Population Density and the peak death rate. Counter-intuitively we see that counties with lower population densities in fact are more correlated to strong peaks in death rate per capita.


Modeling Features

Next we will quantify any correlation with a linear model to see if density or another feathure has an significant relationship to our target

Peak Avg Death Rate

The death rate is calculated as the ratio of deaths to cases. In other words, how well were the community and medical facilaties able to handle Covid cases by prevent fatalities. We calculate this as a rolling average of the growth figure to look for spikes in the impact. Meaning was there a spike in cases that also resulted in a high death rate per case.

We look back over time and find the peak of this ratio and compare it variables such as population, density, total cases or peak new case growth

model <- lm(peak_avg_death_rate ~ Population + density + total_cases + peak_new_cases,
             data = max_covid_df)

# Highlight significant p-values in the model summary
tidy_model <- broom::tidy(model)

# Create color vector that handles NA values
p_value_colors <- ifelse(is.na(tidy_model$p.value), "black",
                         ifelse(tidy_model$p.value < 0.05, "red", "black"))

tidy_model %>%
  knitr::kable(caption = "Model Summary", digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  column_spec(5, bold = TRUE, color = p_value_colors) %>%
  footnote(general = "Red values indicate statistical significance (p < 0.05)")
Model Summary
term estimate std.error statistic p.value
(Intercept) 4.7186 0.0719 65.6462 0.0000
Population 0.0000 0.0000 -7.0097 0.0000
density 0.0000 0.0000 -0.6803 0.4964
total_cases 0.0000 0.0000 5.3157 0.0000
peak_new_cases 0.0000 0.0000 -0.0711 0.9433
Note:
Red values indicate statistical significance (p < 0.05)

We can see that with this target variable of Peak Death Rate the overall population and total cases were correlated, but population has a negative coefficient. Further, we see that density and peak new case is not statistically significant.

Next, we will fit a model to look for correlation between both population and density to Deaths per Capita

model <- lm(deaths_per_capita ~ Population + density,
             data = max_covid_df)

# Highlight significant p-values in the model summary
tidy_model <- broom::tidy(model)

# Create color vector that handles NA values
p_value_colors <- ifelse(is.na(tidy_model$p.value), "black",
                         ifelse(tidy_model$p.value < 0.05, "red", "black"))

tidy_model %>%
  knitr::kable(caption = "Model Summary", digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  column_spec(5, bold = TRUE, color = p_value_colors) %>%
  footnote(general = "Red values indicate statistical significance (p < 0.05)")
Model Summary
term estimate std.error statistic p.value
(Intercept) 417.4168 3.3759 123.6479 0.000
Population -0.0001 0.0000 -6.4356 0.000
density -0.0037 0.0019 -1.9436 0.052
Note:
Red values indicate statistical significance (p < 0.05)

Finally, here again, we can see that Population density was not as significantly correlated to deaths per capita.

Conclusion

This analysis examined COVID-19 data across US counties to understand the relationship between population density and virus outcomes. Counter to initial expectations, the data reveals that population density was not a primary driver of COVID-19 severity.

Key Findings

  1. Density and Spread: No clear linear relationship exists between population density and the rate of virus transmission. High-density urban areas did not consistently experience faster spread than rural counties.

  2. Inverse Mortality Pattern: Counties with lower population density showed higher peak death rates per capita. This suggests that proximity to healthcare facilities in urban areas may have provided a protective effect that outweighed the risks of density.

  3. Temporal Patterns: States like New York, which faced early exposure, showed dramatically improved outcomes during later waves, likely due to better treatment protocols and higher vaccination rates. In contrast, states like Alabama and Illinois continued to see elevated fatality ratios throughout the pandemic.

  4. Statistical Significance: Linear models confirmed that while total population and case counts correlated with outcomes, population density itself was not a statistically significant predictor of deaths per capita.


Potential Bias

The original source of the dataset has an LONG list of nuances and caveats to this dataset and its collection. https://github.com/CSSEGISandData/COVID-19.

I will attempt to add a few potential biases of my own here and should be considered when interperting this analysis

Reporting Inconsistencies

The data shows potential reporting bias across different jurisdictions:

  • Florida’s sawtooth pattern in March 2022 suggests irregular reporting intervals (some counties reporting bi-weekly rather than daily)
  • Testing availability: Counties with better healthcare infrastructure likely had higher testing rates, leading to more confirmed cases but potentially lower mortality rates due to earlier detection
  • Testing Decline: as the virus became less lethal, fewer individuals used hospitals or clinics to be diagnosed and therefore confirmed cases are likely underreported in the last months of this data.

Access to Healthcare

Our analysis may be biased by:

  • Rural vs. Urban disparities: Lower density areas may have had less access to testing, leading to underreported cases spiking the ratio to deaths
  • Hospital proximity: The negative correlation between density and death rates may reflect better access to medical care in urban areas rather than inherent differences in virus transmission
  • Socioeconomic factors: Not accounted for in this analysis - income, insurance coverage, and employment type likely influenced both exposure and outcomes

Timeline Effects

  • Vaccine availability: Later waves occurred after vaccine rollout, confounding our density analysis
  • Variant differences: Different COVID variants had varying mortality rates, not captured in our density comparison

FIPS Code Issues

  • Manual correction of Kansas City FIPS code suggests potential data quality issues elsewhere in the dataset

Session Info

pander(sessionInfo())

R version 4.3.3 (2024-02-29)

Platform: x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: zoo(v.1.8-14), kableExtra(v.1.4.0), plotly(v.4.11.0), pander(v.0.6.6), details(v.0.4.0), lubridate(v.1.9.4), forcats(v.1.0.1), stringr(v.1.5.2), dplyr(v.1.1.4), purrr(v.1.1.0), readr(v.2.1.5), tidyr(v.1.3.1), tibble(v.3.3.0), ggplot2(v.4.0.0) and tidyverse(v.2.0.0)

loaded via a namespace (and not attached): gtable(v.0.3.6), xfun(v.0.54), bslib(v.0.9.0), htmlwidgets(v.1.6.4), lattice(v.0.22-7), tzdb(v.0.5.0), vctrs(v.0.6.5), tools(v.4.3.3), crosstalk(v.1.2.2), generics(v.0.1.4), parallel(v.4.3.3), pkgconfig(v.2.0.3), data.table(v.1.17.8), RColorBrewer(v.1.1-3), S7(v.0.2.0), desc(v.1.4.3), lifecycle(v.1.0.4), compiler(v.4.3.3), farver(v.2.1.2), textshaping(v.1.0.4), htmltools(v.0.5.8.1), sass(v.0.4.10), yaml(v.2.3.10), lazyeval(v.0.2.2), pillar(v.1.11.1), crayon(v.1.5.3), jquerylib(v.0.1.4), cachem(v.1.1.0), tidyselect(v.1.2.1), digest(v.0.6.37), stringi(v.1.8.7), fastmap(v.1.2.0), grid(v.4.3.3), cli(v.3.6.5), magrittr(v.2.0.4), broom(v.1.0.10), clipr(v.0.8.0), withr(v.3.0.2), scales(v.1.4.0), backports(v.1.5.0), bit64(v.4.6.0-1), timechange(v.0.3.0), rmarkdown(v.2.30), httr(v.1.4.7), bit(v.4.6.0), png(v.0.1-8), hms(v.1.1.4), evaluate(v.1.0.5), knitr(v.1.50), viridisLite(v.0.4.2), rlang(v.1.1.6), Rcpp(v.1.1.0), glue(v.1.8.0), xml2(v.1.4.1), svglite(v.2.2.2), rstudioapi(v.0.17.1), vroom(v.1.6.6), jsonlite(v.2.0.0), R6(v.2.6.1) and systemfonts(v.1.3.1)